Hi all, 

Recently, while working on a post-processing engine, I have realized that 
currently it is not straightforward to deal with the data produced by Linking 
engines. Basically, in my opinion, the problem is that there is not currently 
easy to relate the results of NLP analysis with the results of the Linking 
process. After NLP analysis, all the extracted Spans (tokens, sentences, chunks 
and so on) are stored in an AnalyzedText object [1]. This model has a nice to 
use API and it really eases the work in the next engines within a chain. 
However, the result of the Linking Engines are currently only stored in the 
Clerezza graph holding the metadata of a ContentItem mainly as Text and Entity 
Annotations. Although there are some helpers to deal with the annotations 
within the graph, when developing a, let’s say, post-linking engine, a 
developer really miss a way to find, for example, the text and entity 
annotations that could be associated with the spans. The only way I have found 
without started to work on a good solution for this, has been to locate the 
spans associated to a Text Annotation by using the start and end offsets.

I would like to start a discussion here about the best design for tackling this 
problem.

Cheers,
Rafa

[1] - https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext


Reply via email to