Re: Approach for keeping track of formatting associated with text views

Mario Gazzo Fri, 13 Mar 2015 15:59:07 -0700

The issue has now been created:

https://issues.apache.org/jira/browse/UIMA-4286 
<https://issues.apache.org/jira/browse/UIMA-4286>



> On 11 Mar 2015, at 14:47 , Mario Gazzo <mario.ga...@gmail.com> wrote:
> 
> Thanks, I understand the choices now. I would also probably prefer to use the 
> document annotation if no text content is associated with the tag. However, 
> ideally I would prefer that tag annotations get the offsets of content that 
> is within their scope but otherwise get offsets of content within their 
> closest shared ancestor element. Ultimately this could end up being the 
> document annotation. E.g.
> 
> <journal-meta>
>    <journal-id journal-id-type="nlm-ta">Environ Health Perspect</journal-id>
>    <journal-title>Environmental Health Perspectives</journal-title>
>    <issn pub-type="ppub">0091-6765</issn>
>    <publisher>
>        <publisher-name>National Institute of Environmental Health 
> Sciences</publisher-name>
>    </publisher>
> </journal-meta>
> 
> I would here expect journal-meta to have the offsets of all content within 
> its scope, which in the converted view of my experiments gets combined to the 
> following “Environ Health PerspectEnvironmental Health 
> Perspectives0091-6765National Institute of Environmental Health Sciences”. 
> This works as expected when I just disable the “inBody”-flag of the 
> HtmlConverterVisitor except that there is no clear separation between the 
> content elements any longer, which is why I would like to have a sentence 
> separator like “. ” between them so that I instead get: “Environ Health 
> Perspect. Environmental Health Perspectives. 0091-6765. National Institute of 
> Environmental Health Sciences.”. The dot separators should then of course not 
> be included in the converters offsets since they are not part of the original 
> text.
> 
> Additionally there might be a case where a meta tag doesn’t have any content 
> within its scope but it contains attribute values:
> 
> <Parent>
>       <Child1 attribute=“someValue” />
>       <Child2/>Some content.</Child2>
> </Parent>
> 
> In this case I would prefer that Child1 has the same offsets as Child2 since 
> the tag is most closely related to that content. In case there is no content 
> within the scope of its parent then I would find the first ancestor that 
> contains content within its scope and use that offset although this choice is 
> questionable. I haven’t a good example of this case though so I presume they 
> are in reality rare.
> 
> That said, the latter is more complicated to implement, so I would be happy 
> if I could just easily turn off the “inBody”-test in the HtmlConverterVisitor 
> and have some way to add content separation between tags outside body without 
> resorting to code modifications.
> 
> Hope this feedback was helpful.
> 
> Your time is much appreciated, thanks.
> 
> 
>> On 09 Mar 2015, at 16:56 , Jens Grivolla <j+...@grivolla.net> wrote:
>> 
>> Hi Peter, while I don't think I will be using the HtmlConverter right away,
>> I would vote for using the length of the document annotation for
>> annotations that relate to the whole document (such as metadata).  That
>> makes them show up nicely in the CasEditor/Viewer and you could maintain it
>> in all segments when you split a CAS (e.g. with something based on the
>> SimpleTextSegmenter example).
>> 
>> -- Jens
>> 
>> On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pklu...@uni-wuerzburg.de>
>> wrote:
>> 
>>> Hi,
>>> 
>>> there is no way yet to customize this behavior. The HtmlConverter only
>>> retains annotation of a length > 0 since annoations with length == 0 are
>>> rather problematic and should be avoided.
>>> 
>>> I can add a configuration parameter for keeping these annoations if you
>>> want (best open an issue for it). What should be the offsets of the
>>> annotations for elements in the head of the html document? 0, those of the
>>> first token or those of the document annotation?
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
>>> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>>> 
>>> We conducted some experiments with both the HtmlAnnotator and the
>>>> HtmlConverter but we ran into an issue with the converter. It appears to
>>>> only convert tag annotations that surround or are inside the body tag.
>>>> Metadata elements like citations are ignored. The only way to get around
>>>> this seems to be by forking and modifying the codebase, which I like to
>>>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>>>> better approach to solve this issue. Is there some way to customise this
>>>> behaviour without code modifications?
>>>> 
>>>> Your input is appreciated, thanks.
>>>> 
>>>> 
>>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <mario.ga...@gmail.com> wrote:
>>>>> 
>>>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>>>> have a closer look at it.
>>>>> 
>>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pklu...@uni-wuerzburg.de>
>>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>>>> HtmlAnnotator and HtmlConverter [1]
>>>>>> 
>>>>>> The former one creates annotations for html element and therefore also
>>>>>> for xml tags. The latter one creates a new view with only the plain text
>>>>>> and adds existing annotations while adapting their offsets to the new
>>>>>> document.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Peter
>>>>>> 
>>>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>>>> ugr.tools.ruta.ae.html
>>>>>> 
>>>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>>> 
>>>>>>> We are starting to use the UIMA framework for NL processing article
>>>>>>> text, which is usually stored with metadata in some XML format. We need 
>>>>>>> to
>>>>>>> extract text elements to be processed by various NL analysis engines 
>>>>>>> that
>>>>>>> only work with pure text but we also need to keep track of the 
>>>>>>> formatting
>>>>>>> information related to the processed text. It is in general also 
>>>>>>> valuable
>>>>>>> for us to be able to track every annotation back to the original XML to
>>>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>>>> approach with more experienced users since this is the first 
>>>>>>> application we
>>>>>>> are building with UIMA.
>>>>>>> 
>>>>>>> In the first step we would annotate every important element of the XML
>>>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>>>> relationships between the body text and formatting annotations so that 
>>>>>>> text
>>>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>>>> viewer.
>>>>>>> 
>>>>>>> Next we would in another AE produce a pure text view of the text
>>>>>>> annotations in the XML view that need to be NL analysed. In this new 
>>>>>>> text
>>>>>>> view we would annotate the different text elements with references back 
>>>>>>> to
>>>>>>> their counterpart in the original XML view so that we can trace back
>>>>>>> positions in the original XML and the formatting relations. This of 
>>>>>>> course
>>>>>>> will require mapping NLP annotation offsets in the text view back to the
>>>>>>> XML view but the information should then be there to make this possible.
>>>>>>> 
>>>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>>>> examples of how this is usually done and the UIMA docs are vague 
>>>>>>> regarding
>>>>>>> managing this kind of relationships across views. We would therefore 
>>>>>>> really
>>>>>>> like to know if there is a simpler and better approach.
>>>>>>> 
>>>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>>> 
>>>>>> 
>>> 
>

Re: Approach for keeping track of formatting associated with text views

Reply via email to