Hi Andrea
The CELI Lemmatizer engine (see STANBOL-583) does exactly that. It
creates TextAnnotations for each word and adds the POS and Lemma (if
you enable the "Full Morphological Analysis" in its configuration).
Here is an example for "Tagen" of the german sentence "An Tagen wie
diesen würde man lieber baden gehen!"
<urn:enhancement-3bf15662-f87a-dcba-e4cb-92024b167d30>
a <http://fise.iks-project.eu/ontology/TextAnnotation> ,
<http://fise.iks-project.eu/ontology/Enhancement> ;
<http://fise.iks-project.eu/ontology/selected-text>
"Tagen"@de ;
<http://fise.iks-project.eu/ontology/selection-context>
"An Tagen wie diesen würde man lieber baden gehen!"@de ;
<http://fise.iks-project.eu/ontology/start>
"3"^^<http://www.w3.org/2001/XMLSchema#int> ;
<http://fise.iks-project.eu/ontology/end>
"8"^^<http://www.w3.org/2001/XMLSchema#int> ;
<http://fise.iks-project.eu/ontology/hasLemmaForm>
"tagen"@de , "Tag"@de ;
<http://fise.iks-project.eu/ontology/hasMorphologicalFeature>
"MOOD=SUB" , "MOOD=INF" , "POS=N" , "PERSON=P3" ,
"CASE=DAT" , "POS=V"^^ , "TENSE=PRS" , "GENDER=MAS" , "NUMBER=PLU" ;
This engines uses the Properties
* fise:hasLemmaForm
* fise:hasMorphologicalFeature: values are {key}={value}
to encode results of the Morphological analyses. However note that
this two properties are NOT specified in the Stanbol Enhancement
Structure.
Doing the same with the POSTagger of OpenNLP would be quite easy.
Especially when you use the
"org.apache.stanbol.commons.opennlp.TextAnalyzer" as the
KeywordLinkingEngine does.
@Reference
OpenNLP openNLP; //injected -> loads models from config
//get the plain text from the ContentItem
Entry<UriRef,Blob> contentPart = ContentItemHelper.getBlob(ci,
Collections.singleton("text/plain"));
String text = ContentItemHelper.getText(contentPart.getValue());
//get the language of the Text
String lang = EnhancementEngineHelper.getLanguage(ci);
//Analyze the text
//config for the TextAnalyzer ... you may expose some of them
//in the Engine config
TextAnalyzerConfig config = new TextAnalyzerConfig(); //uses defaults
//create the TextAnalyzer
TextAnalyzer analyzer = new TextAnalyzer(openNLP, language,config);
//process the text
Iterator<AnalysedText> analysedSentences = analyzer.analyse(text);
while(analysedSentences.hasNext()){
AnalysedText analysed = analysedSentences.next();
//NOTE: depending on the config and the available models
// Tokens and/or Chunks might not be present
for(Token token : tokens){
String posTag = token.getPosTag();
double posProb = token.getPosProbability();
}
for(Chunk chunk : chunks){
//similar things for chunks
}
}
While iterating over the sentences, tokens and chunk you could create
similar TextAnnotations as created by the CELI engine
However note that - as Olivier mentioned - this creates a lot of RDF
triples. So it will not scale to very long texts. Assume 20
Triples/Word. So texts with some thousands words should be still fine,
but if you analyze longer texts you will run into performance and
memory issues.
best
Rupert
On Thu, Jun 21, 2012 at 2:52 PM, Olivier Grisel
<[email protected]> wrote:
> 2012/6/21 Andrea Taurchini <[email protected]>:
>> Dear Olivier,
>> thanks for your reply.
>> Ok, so it is possible, but I have to implement it as a new Engine on my own.
>> As for "Tagging Server" is a new restful interface to OpenNLP exposing on
>> http its algorithm.
>
> Alright then you can indeed write a set of new low level, pure NLP
> engines and let delegate the semantic intepretations of such
> annotations to the caller.
>
> The Stanbol RDF-based output format might be a little bit verbose for
> such kind of low level annotations though.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
--
| Rupert Westenthaler [email protected]
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen