Hello, sorry for the late answer. Thank you for yours
2013/6/3 Rupert Westenthaler <[email protected]> > Hi Joseph > > On Mon, Jun 3, 2013 at 3:43 PM, Joseph M'Bimbi-Bene > <[email protected]> wrote: > [..] > > > > Now, the logs of the processing of the token "La" > > > > ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos: > > ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none' > > > > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| > > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > > upperCase=true] > > > > The reason why the 'La' of the last sentence of your document is > marked as 'linkable' is the combination of the following things: > > 1. the POS tag has a very low probability (0.017) and is therefore > ignored as the configured minimum probability is higher as that. > Actually, i set both parameters "prop" and "pprob" to 0.01 , i didn't commit any mistake, did i ? You mentionned or a previous mail something about a strange tokenizing behaviour, it might be a source of a new problem: here is, for example a log excerpt from the stanbol web console for an integration test. I isolated the pathologic case : org.apache.stanbol.enhancer.servicesapi.ChainException: Enhancement Chain failed because of required Engine 'talismane-nlp' failed with Message: Unable to process ContentItem '<urn:content-item-sha1-27bdb282be8f827392a55c8cd8d0ee5c740e247a>' with Enhancement Engine 'talismane-nlp' because the engine was unable to process the content (Engine class: org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine)(Reason: 'RestfulNlpAnalysisEngine' failed to process content item 'urn:content-item-sha1-27bdb282be8f827392a55c8cd8d0ee5c740e247a' with type 'text/plain': Exception while executing Request on RESTful NLP Analysis Service at http://localhost:9101/analysis)! at org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl.enhanceContent(EventJobManagerImpl.java:179) [...] Caused by: org.apache.stanbol.enhancer.servicesapi.EngineException: 'RestfulNlpAnalysisEngine' failed to process content item 'urn:content-item-sha1-27bdb282be8f827392a55c8cd8d0ee5c740e247a' with type 'text/plain': Exception while executing Request on RESTful NLP Analysis Service at http://localhost:9101/analysis at org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine.computeEnhancements(RestfulNlpAnalysisEngine.java:285) at org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.processEvent(EnhancementJobHandler.java:271) [...] Caused by: org.apache.http.client.HttpResponseException: Internal Server Error at org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine$AnalysisResponseHandler.handleResponse(RestfulNlpAnalysisEngine.java:367) at org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine$AnalysisResponseHandler.handleResponse(RestfulNlpAnalysisEngine.java:341) and when i curl the text to Talismane, i get the following message: 16:49:21,166 [main] INFO server.Main - ... starting server 16:53:55,560 [btpool0-2] ERROR resource.AnalysisResource - Exception while analysing Blob java.lang. IllegalArgumentException: Illegal span [2199,2201] for Token relative to Text: [0, 2200] : Span of the contained Token MUST NOT extend the others! at org.apache.stanbol.enhancer.nlp.model.impl.SpanImpl.<init>(SpanImpl.java:78) at org.apache.stanbol.enhancer.nlp.model.impl.TokenImpl.<init>(TokenImpl.java:33) at org.apache.stanbol.enhancer.nlp.model.impl.SectionImpl.addToken(SectionImpl.java:146) at at.salzburgresearch.stanbol.enhancer.nlp.talismane.analyser.TalismaneAnalyzer.processSentence(TalismaneAnalyzer.java:329) i will give you the text in private so you can try to reproduce the bug since i don't want it to be online EDIT: i tried again just now, i have the same message in the stanbol console, but everything is fine when curling Talismane ... I also tried on the "dbpedia-disambiguation" enhancing chain, with the french opennlp model published by Mr. Hernandez that i talked to you about some time ago, as i reminder, here is the link it it seem quite OK. No entities spotted but > 2. Proper Noun Linking (enhancer.engines.linking.properNounsState) is > deactivated > 3. UpperCase linking of Tokens without POS tag > (enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag) is > also deactivated. As the default is the same as the value for Proper > Noun Linking (see STANBOL-1049). > 4. The Min Search Token Length > (enhancer.engines.linking.minSearchTokenLength) is set to two. > > As there is no POS tag and UpperCase linking of Tokens without POS tag > is deactivated, the Min Search Token Length is the only criteria used > to classify the Token. As 'La' has >= two chars it is therefore > classified as a 'linkable' token. If the token is upper/lower case > and/or at the beginning of a sentence is no of any importance in this > specific case. > > This might seam strange in the given context, but in situations where > there is no POS tagging support for the language of the parsed text > this behavior is completely fine and important. Otherwise it would not > be possible to link entities mentioned as first word in a sentence. > > In your specific case explicitly setting > 'enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag=true' > would prevent 'La' to be classified as 'linkable'. But the root of the > problem is Talismane failing to detect the POS tag for this token. > > Situations like that do however suggest to investigate if > EntityLinking should use different fallback strategies for > > * linking texts without POS tags - where no POS tagger is available > for the language of the text > * linking tokens with missing POS tag - where a POS tagger is present, > but it was not able to classify a token > > best > Rupert > > > > [...] > > > > EntityLinker --- preocess Token 0: La (lemma: null) linkable=true, > > matchable=true | chunk: none > > > > EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true > > > > EntityLinker >> searchStrings [La, recherche] > > > > EntityLinker - found 1 entities ... > > > > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) > > > > abelTokenizer for language null > > > > abelTokenizer Language null not configured to be supported > > > > abelTokenizer for language null > > > > abelTokenizer Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > > for language null > > > > MainLabelTokenizer - tokenized la -> [la] > > > > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > > > > EntityLinker >> Suggestions: > > > > EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > > > > > > So same as before. Is Open NLP working along well with Talismane. I saw > > that the ranking the sentence detection engine was lower than the ranking > > of talismane and the linking engine (-100 vs 0) since the documentation > of > > the engine says > > *"Language* (required): The language of the text needs to be available. > It > > is read as specified by > > STANBOL-613<https://issues.apache.org/jira/browse/STANBOL-613>from the > > metadata of the ContentItem. Effectively this means that any > > Stanbol Language Detection engine will need to be executed *before the > > OpenNLP POS Tagging Engine.*" which is Talismane in my case. > > > > The logs are exactly the same, but just for the sake of it (or if i > missed > > something), i will copy them: > > > > OpenNlpSentenceDetectionEngine > add Sentence: [249, 431] > > > > OpenNlpSentenceDetectionEngine > add Sentence: [432, 513] > > > > OpenNlpSentenceDetectionEngine > add Sentence: [514, 552] > > > > OpenNlpSentenceDetectionEngine > add Sentence: [554, 780] > > > > OpenNlpSentenceDetectionEngine > add Sentence: [781, 971] > > > > OpenNlpSentenceDetectionEngine > add Sentence: [972, 1085] > > > > OpenNlpSentenceDetectionEngine > add Sentence: [1087, 1264] > > > > > > ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos: > > ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none' > > > > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| > > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > > upperCase=true] > > > > > > EntityLinker --- preocess Token 0: La (lemma: null) linkable=true, > > matchable=true | chunk: none > > > > EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true > > > > EntityLinker >> searchStrings [La, recherche] > > > > EntityLinker - found 1 entities ... > > > > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 03.06.2013 15:41:53.188 *TRACE* [Thread-5674] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 03.06.2013 15:41:53.188 *TRACE* [Thread-5674] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > > for language null > > > > MainLabelTokenizer - tokenized la -> [la] > > > > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > > > > EntityLinker >> Suggestions: > > > > EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > > > > > > > > > > Can you sent the text sample you used, so that I can check why > >> Talismane fails to correctly split the sentences. > >> > >> best > >> Rupert > >> > >> > Here are some log excerpts: > >> > > >> > On token 'La', which is (i think) a determiner, anyway, definitely > not a > >> > Noun : > >> > > >> > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: * > >> > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none' > >> > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| > >> > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > >> > upperCase=true] > >> > > >> > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true, > >> > matchable=true | chunk: none > >> > > >> > >> Here it says that La is the 15th token of the Sentence. This is the > >> reason why it is marked as linkable. > >> > >> > > ok, i think I understand ... but if i get it right, then by lowercasing > it, > > the token should be linked / linkable too. But it is not, Search "<look > > here>" for the part of the message related to it > > > > > >> > >> > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true, > >> matchable=true > >> > > >> > EntityLinker + 16:'recherche' (lemma: null) linkable=true, > matchable=true > >> > > >> > EntityLinker >> searchStrings [La, recherche] > >> > > >> > EntityLinker - found 1 entities ... > >> > > >> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) > >> > > >> > MainLabelTokenizer > use Tokenizer class > >> > > >> > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > >> > for language null > >> > > >> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > >> > > >> > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > >> > Language null not configured to be supported > >> > > >> > MainLabelTokenizer > use Tokenizer class > >> > > >> > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > >> > for language null > >> > > >> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > >> > > >> > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > >> > Language null not configured to be supported > >> > > >> > MainLabelTokenizer > use Tokenizer class > >> > > >> > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > >> > for language null > >> > > >> > MainLabelTokenizer - tokenized la -> [la] > >> > > >> > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > >> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > >> > EntityLinker >> Suggestions:EntityLinker - 0: > LA[m=FULL,s=1,c=1(1.0)/1] > >> > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LAranking: > >> > null > >> > > >> > > >> > Then i went to the page of the jira issue 1049 and i guessed my token > >> > corresponded to "unknown POS tag rule". > >> > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> > does > >> > this have anything to do with the the *Upper Case Token Mode > *parameter > >> ?* > >> > * > >> > Since my tokens 'La' are always at the beginning of the sentence, i > >> guessed > >> > they falled in the category: > >> > "else - lower case token or sentence or sub-sentence start > >> > * tokens equals or longer as > >> > TextProcessingConfig#minSearchTokenLength are marked as matchable" > >> > > >> > I don't understand that rule: is that supposed to override the *Upper > >> Case > >> > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to > >> 'la' > >> > and the tokens 'la are never processed. Here is the log excerpt: > >> > > >> > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos: > >> > > DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409]) > >> > chunk: 'none' > >> > > >> > > > > <look here> > > > > > >> > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)| > >> > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true| > >> > upperCase=false] > >> > > >> > > >> > After i few minutes of reflexion, i see that linkabkePos and > >> matchablePos are > >> > no longer equals to "null". What is the rule to set them to null or > not. > >> It > >> > is strange that just an uppercase can change the POS tag of the token > >> that > >> > drastically for Talismane but i cannot do anything about it. I still > have > >> > the interrogation about the supposed overriding of the *Upper Case > Token > >> > Mode *parameter for "unknown POS tag rule". > >> > > >> > > >> > > >> > On a quite related topic, the *Upper Case Token Mode *parameter > doesn't > >> > seem to behave properly (or i missed something). i let "uc=NONE" in > the > >> > config of the engine and monitored the processing of the token, here > are > >> > the logs. On the token "utilisée" for the text: "AE est une mesure > >> > couramment utilisée." > >> > > >> > ProcessingState > 5: Token: [543, 551] utilisée (pos:[Value [pos: > >> > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk: > >> 'none' > >> > ProcessingState - TokenData: > >> > 'utilisée'[linkable=false(linkabkePos=false)| > >> > matchable=false(matchablePos=false)| alpha=true| seachLength=true| > >> > upperCase=false] > >> > > >> > token is not processed, which i am fine with since its POS tag is VPP > >> > > >> > > >> > Now On the token "Utilisée" for the text: "AE est une mesure > couramment > >> > Utilisée." > >> > ProcessingState > 5: Token: [543, 551] Utilisée (pos:[Value [*pos: > >> NPP* > >> > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: > 'none' > >> > ProcessingState - TokenData: > >> > 'Utilisée'[linkable=true(linkabkePos=null)| > >> > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > >> > upperCase=true] > >> > > >> > so the POS tag is OK, but the prob doesn't reach the threshold (which > i > >> set > >> > to 0.55), here is the log of the processing of the token > >> > > >> > EntityLinker --- preocess Token 5: Utilisée (lemma: null) > linkable=true, > >> > matchable=true | chunk: none > >> > EntityLinker - 4:'couramment' (lemma: null) linkable=false, > >> > matchable=false > >> > EntityLinker - 6:'.' (lemma: null) linkable=false, matchable=false > >> > EntityLinker + 3:'mesure' (lemma: null) linkable=true, > matchable=true > >> > EntityLinker >> searchStrings [mesure, Utilisée] > >> > > >> > is it a problem of processing of POS tagging, of UpperCase linking or > >> did i > >> > misunderstood something. > >> > > >> > Thank you for the time you spend helping us users, it is very > >> appreciated. > >> > best regard, Joseph > >> > > >> > 2013/6/3 Rupert Westenthaler <[email protected]> > >> > > >> >> Hi Joseph > >> >> > >> >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene > >> >> <[email protected]> wrote: > >> >> > I think it is the tokenizing process of Talismane NLP, since my > >> >> enhancement > >> >> > chain is : > >> >> > -langdetect > >> >> > -talismaneNLP > >> >> > -MyVocabulary > >> >> > > >> >> > >> >> I also used Talismane when testing and I was not seeing tokens like > that > >> >> > >> >> Here are an excerpt of my log (with minSearchTokenLength set to 2) > >> >> > >> >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true > >> >> | chunk: none > >> >> - 10:'*' (lemma: null) linkable=false, matchable=false > >> >> - 12:'*' (lemma: null) linkable=false, matchable=false > >> >> - 9:'indiquant' (lemma: null) linkable=false, matchable=false > >> >> - 13:'une' (lemma: null) linkable=false, matchable=false > >> >> - 8:')' (lemma: null) linkable=false, matchable=false > >> >> + 14:'servitude' (lemma: null) linkable=false, matchable=true > >> >> >> searchStrings [AE, servitude] > >> >> > >> >> best > >> >> Rupert > >> >> > >> >> > >> >> -- > >> >> | Rupert Westenthaler [email protected] > >> >> | Bodenlehenstraße 11 ++43-699-11108907 > >> >> | A-5500 Bischofshofen > >> >> > >> > >> > >> > >> -- > >> | Rupert Westenthaler [email protected] > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
