Hi Peter, while I don't think I will be using the HtmlConverter right away, I would vote for using the length of the document annotation for annotations that relate to the whole document (such as metadata). That makes them show up nicely in the CasEditor/Viewer and you could maintain it in all segments when you split a CAS (e.g. with something based on the SimpleTextSegmenter example).
-- Jens On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pklu...@uni-wuerzburg.de> wrote: > Hi, > > there is no way yet to customize this behavior. The HtmlConverter only > retains annotation of a length > 0 since annoations with length == 0 are > rather problematic and should be avoided. > > I can add a configuration parameter for keeping these annoations if you > want (best open an issue for it). What should be the offsets of the > annotations for elements in the head of the html document? 0, those of the > first token or those of the document annotation? > > Best, > > Peter > > > Am 06.03.2015 um 14:00 schrieb Mario Gazzo: > > We conducted some experiments with both the HtmlAnnotator and the >> HtmlConverter but we ran into an issue with the converter. It appears to >> only convert tag annotations that surround or are inside the body tag. >> Metadata elements like citations are ignored. The only way to get around >> this seems to be by forking and modifying the codebase, which I like to >> avoid. Both modules seem otherwise very useful to us but I am looking for a >> better approach to solve this issue. Is there some way to customise this >> behaviour without code modifications? >> >> Your input is appreciated, thanks. >> >> >> On 18 Feb 2015, at 23:03 , Mario Gazzo <mario.ga...@gmail.com> wrote: >>> >>> Thanks. Looks interesting, seems that it could fit our use case. We will >>> have a closer look at it. >>> >>> On 18 Feb 2015, at 21:58 , Peter Klügl <pklu...@uni-wuerzburg.de> >>>> wrote: >>>> >>>> Hi, >>>> >>>> you might want to take a look at two analysis engines of UIMA Ruta: >>>> HtmlAnnotator and HtmlConverter [1] >>>> >>>> The former one creates annotations for html element and therefore also >>>> for xml tags. The latter one creates a new view with only the plain text >>>> and adds existing annotations while adapting their offsets to the new >>>> document. >>>> >>>> Best, >>>> >>>> Peter >>>> >>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html# >>>> ugr.tools.ruta.ae.html >>>> >>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo: >>>> >>>>> We are starting to use the UIMA framework for NL processing article >>>>> text, which is usually stored with metadata in some XML format. We need to >>>>> extract text elements to be processed by various NL analysis engines that >>>>> only work with pure text but we also need to keep track of the formatting >>>>> information related to the processed text. It is in general also valuable >>>>> for us to be able to track every annotation back to the original XML to >>>>> maintain provenance. Before embarking on this I like to validate our >>>>> approach with more experienced users since this is the first application >>>>> we >>>>> are building with UIMA. >>>>> >>>>> In the first step we would annotate every important element of the XML >>>>> including formatting elements in the body. We maintain some DOM-like >>>>> relationships between the body text and formatting annotations so that >>>>> text >>>>> formatting can be reproduced later with NLP annotations in some article >>>>> viewer. >>>>> >>>>> Next we would in another AE produce a pure text view of the text >>>>> annotations in the XML view that need to be NL analysed. In this new text >>>>> view we would annotate the different text elements with references back to >>>>> their counterpart in the original XML view so that we can trace back >>>>> positions in the original XML and the formatting relations. This of course >>>>> will require mapping NLP annotation offsets in the text view back to the >>>>> XML view but the information should then be there to make this possible. >>>>> >>>>> This approach requires somewhat more handcrafted book keeping than we >>>>> initially hoped would be necessary. We haven’t been able to find any >>>>> examples of how this is usually done and the UIMA docs are vague regarding >>>>> managing this kind of relationships across views. We would therefore >>>>> really >>>>> like to know if there is a simpler and better approach. >>>>> >>>>> Any feedback is greatly appreciated. Thanks. >>>>> >>>> >