Hi all, Recently, while working on a post-processing engine, I have realized that currently it is not straightforward to deal with the data produced by Linking engines. Basically, in my opinion, the problem is that there is not currently easy to relate the results of NLP analysis with the results of the Linking process. After NLP analysis, all the extracted Spans (tokens, sentences, chunks and so on) are stored in an AnalyzedText object [1]. This model has a nice to use API and it really eases the work in the next engines within a chain. However, the result of the Linking Engines are currently only stored in the Clerezza graph holding the metadata of a ContentItem mainly as Text and Entity Annotations. Although there are some helpers to deal with the annotations within the graph, when developing a, let’s say, post-linking engine, a developer really miss a way to find, for example, the text and entity annotations that could be associated with the spans. The only way I have found without started to work on a good solution for this, has been to locate the spans associated to a Text Annotation by using the start and end offsets.
I would like to start a discussion here about the best design for tackling this problem. Cheers, Rafa [1] - https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext