The issue has now been created: https://issues.apache.org/jira/browse/UIMA-4286 <https://issues.apache.org/jira/browse/UIMA-4286>
> On 11 Mar 2015, at 14:47 , Mario Gazzo <mario.ga...@gmail.com> wrote: > > Thanks, I understand the choices now. I would also probably prefer to use the > document annotation if no text content is associated with the tag. However, > ideally I would prefer that tag annotations get the offsets of content that > is within their scope but otherwise get offsets of content within their > closest shared ancestor element. Ultimately this could end up being the > document annotation. E.g. > > <journal-meta> > <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id> > <journal-title>Environmental Health Perspectives</journal-title> > <issn pub-type="ppub">0091-6765</issn> > <publisher> > <publisher-name>National Institute of Environmental Health > Sciences</publisher-name> > </publisher> > </journal-meta> > > I would here expect journal-meta to have the offsets of all content within > its scope, which in the converted view of my experiments gets combined to the > following “Environ Health PerspectEnvironmental Health > Perspectives0091-6765National Institute of Environmental Health Sciences”. > This works as expected when I just disable the “inBody”-flag of the > HtmlConverterVisitor except that there is no clear separation between the > content elements any longer, which is why I would like to have a sentence > separator like “. ” between them so that I instead get: “Environ Health > Perspect. Environmental Health Perspectives. 0091-6765. National Institute of > Environmental Health Sciences.”. The dot separators should then of course not > be included in the converters offsets since they are not part of the original > text. > > Additionally there might be a case where a meta tag doesn’t have any content > within its scope but it contains attribute values: > > <Parent> > <Child1 attribute=“someValue” /> > <Child2/>Some content.</Child2> > </Parent> > > In this case I would prefer that Child1 has the same offsets as Child2 since > the tag is most closely related to that content. In case there is no content > within the scope of its parent then I would find the first ancestor that > contains content within its scope and use that offset although this choice is > questionable. I haven’t a good example of this case though so I presume they > are in reality rare. > > That said, the latter is more complicated to implement, so I would be happy > if I could just easily turn off the “inBody”-test in the HtmlConverterVisitor > and have some way to add content separation between tags outside body without > resorting to code modifications. > > Hope this feedback was helpful. > > Your time is much appreciated, thanks. > > >> On 09 Mar 2015, at 16:56 , Jens Grivolla <j+...@grivolla.net> wrote: >> >> Hi Peter, while I don't think I will be using the HtmlConverter right away, >> I would vote for using the length of the document annotation for >> annotations that relate to the whole document (such as metadata). That >> makes them show up nicely in the CasEditor/Viewer and you could maintain it >> in all segments when you split a CAS (e.g. with something based on the >> SimpleTextSegmenter example). >> >> -- Jens >> >> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pklu...@uni-wuerzburg.de> >> wrote: >> >>> Hi, >>> >>> there is no way yet to customize this behavior. The HtmlConverter only >>> retains annotation of a length > 0 since annoations with length == 0 are >>> rather problematic and should be avoided. >>> >>> I can add a configuration parameter for keeping these annoations if you >>> want (best open an issue for it). What should be the offsets of the >>> annotations for elements in the head of the html document? 0, those of the >>> first token or those of the document annotation? >>> >>> Best, >>> >>> Peter >>> >>> >>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo: >>> >>> We conducted some experiments with both the HtmlAnnotator and the >>>> HtmlConverter but we ran into an issue with the converter. It appears to >>>> only convert tag annotations that surround or are inside the body tag. >>>> Metadata elements like citations are ignored. The only way to get around >>>> this seems to be by forking and modifying the codebase, which I like to >>>> avoid. Both modules seem otherwise very useful to us but I am looking for a >>>> better approach to solve this issue. Is there some way to customise this >>>> behaviour without code modifications? >>>> >>>> Your input is appreciated, thanks. >>>> >>>> >>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <mario.ga...@gmail.com> wrote: >>>>> >>>>> Thanks. Looks interesting, seems that it could fit our use case. We will >>>>> have a closer look at it. >>>>> >>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pklu...@uni-wuerzburg.de> >>>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> you might want to take a look at two analysis engines of UIMA Ruta: >>>>>> HtmlAnnotator and HtmlConverter [1] >>>>>> >>>>>> The former one creates annotations for html element and therefore also >>>>>> for xml tags. The latter one creates a new view with only the plain text >>>>>> and adds existing annotations while adapting their offsets to the new >>>>>> document. >>>>>> >>>>>> Best, >>>>>> >>>>>> Peter >>>>>> >>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html# >>>>>> ugr.tools.ruta.ae.html >>>>>> >>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo: >>>>>> >>>>>>> We are starting to use the UIMA framework for NL processing article >>>>>>> text, which is usually stored with metadata in some XML format. We need >>>>>>> to >>>>>>> extract text elements to be processed by various NL analysis engines >>>>>>> that >>>>>>> only work with pure text but we also need to keep track of the >>>>>>> formatting >>>>>>> information related to the processed text. It is in general also >>>>>>> valuable >>>>>>> for us to be able to track every annotation back to the original XML to >>>>>>> maintain provenance. Before embarking on this I like to validate our >>>>>>> approach with more experienced users since this is the first >>>>>>> application we >>>>>>> are building with UIMA. >>>>>>> >>>>>>> In the first step we would annotate every important element of the XML >>>>>>> including formatting elements in the body. We maintain some DOM-like >>>>>>> relationships between the body text and formatting annotations so that >>>>>>> text >>>>>>> formatting can be reproduced later with NLP annotations in some article >>>>>>> viewer. >>>>>>> >>>>>>> Next we would in another AE produce a pure text view of the text >>>>>>> annotations in the XML view that need to be NL analysed. In this new >>>>>>> text >>>>>>> view we would annotate the different text elements with references back >>>>>>> to >>>>>>> their counterpart in the original XML view so that we can trace back >>>>>>> positions in the original XML and the formatting relations. This of >>>>>>> course >>>>>>> will require mapping NLP annotation offsets in the text view back to the >>>>>>> XML view but the information should then be there to make this possible. >>>>>>> >>>>>>> This approach requires somewhat more handcrafted book keeping than we >>>>>>> initially hoped would be necessary. We haven’t been able to find any >>>>>>> examples of how this is usually done and the UIMA docs are vague >>>>>>> regarding >>>>>>> managing this kind of relationships across views. We would therefore >>>>>>> really >>>>>>> like to know if there is a simpler and better approach. >>>>>>> >>>>>>> Any feedback is greatly appreciated. Thanks. >>>>>>> >>>>>> >>> >