We are starting to use the UIMA framework for NL processing article text, which 
is usually stored with metadata in some XML format. We need to extract text 
elements to be processed by various NL analysis engines that only work with 
pure text but we also need to keep track of the formatting information related 
to the processed text. It is in general also valuable for us to be able to 
track every annotation back to the original XML to maintain provenance. Before 
embarking on this I like to validate our approach with more experienced users 
since this is the first application we are building with UIMA.

In the first step we would annotate every important element of the XML 
including formatting elements in the body. We maintain some DOM-like 
relationships between the body text and formatting annotations so that text 
formatting can be reproduced later with NLP annotations in some article viewer.

Next we would in another AE produce a pure text view of the text annotations in 
the XML view that need to be NL analysed. In this new text view we would 
annotate the different text elements with references back to their counterpart 
in the original XML view so that we can trace back positions in the original 
XML and the formatting relations. This of course will require mapping NLP 
annotation offsets in the text view back to the XML view but the information 
should then be there to make this possible.

This approach requires somewhat more handcrafted book keeping than we initially 
hoped would be necessary. We haven’t been able to find any examples of how this 
is usually done and the UIMA docs are vague regarding managing this kind of 
relationships across views. We would therefore really like to know if there is 
a simpler and better approach.

Any feedback is greatly appreciated. Thanks.

Reply via email to