[ 
https://issues.apache.org/jira/browse/UIMA-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378337#comment-14378337
 ] 

Mario Juric commented on UIMA-4286:
-----------------------------------

Some extra remarks to the previous comment:
2) It is inevitable that the gap text is included in the offsets when a tag has 
more than one content element in its subtree. The gap text is not included when 
I look at a single element only, which is what I want.
5) After some thinking I don't actually feel that the use case for using a map 
instead of a list in HtmlConverter.PARAM_GAP_INDUCING_TAGS is very strong. We 
wouldn't actually use any NLP on metadata such as author names and the likes. 
This is metadata information we would just extract from the annotations. The 
most important elements I can think of are titel, abstract, body and citation 
titles, which seem to be segmented in sentences as they should. The rest is 
just metadata, which we would extract separately and it doesn't really matter 
if the offsets or the sentence segmenting appears strange in these parts.

> Ruta: HTMLConverter: Option to convert tags outside body tags
> -------------------------------------------------------------
>
>                 Key: UIMA-4286
>                 URL: https://issues.apache.org/jira/browse/UIMA-4286
>             Project: UIMA
>          Issue Type: Improvement
>          Components: ruta
>    Affects Versions: 2.2.1ruta
>            Reporter: Mario Juric
>            Assignee: Peter Klügl
>             Fix For: 2.3.0ruta
>
>
> The HTML converter only converts tags that are found inside the body tag. 
> Therefore some information carrying tags like citations get left out when 
> applying the converter to XML articles with many metadata. It would be useful 
> to add the option to have all tags converted since this would allow content 
> outside the body to be parsed by natural language analysers as well.
> The converter was originally, as the name implies, conceived for HTML 
> documents but together with the HTML Annotator it can this way be more 
> generally useful in enabling NL parsing of a broader class of documents such 
> as articles stored in XML documents.
> An example of how this option might work can be given by disabling the 
> "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates 
> what offsets to apply to such annotations but otherwise the document 
> annotation offsets can be used. Empty tags can still be ignored but tags with 
> only attributes and no content should preferably be converted.
> Experiments with disabling the "in body"-constraint reveals that there will 
> be an additional need to separate the content metadata tags in the converted 
> text view. An NL parser reading the text will in many case read different 
> tags as one word or one sentence, which is not desirable. Some text delimiter 
> should therefore be inserted between tags were required, which optionally 
> could be customizable as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to