[jira] [Commented] (UIMA-4286) Ruta: HTMLConverter: Option to convert tags outside body tags

JIRA Tue, 17 Mar 2015 13:20:12 -0700

    [ 
https://issues.apache.org/jira/browse/UIMA-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365980#comment-14365980
 ]


Peter Klügl commented on UIMA-4286:
-----------------------------------

I remember that I used the HtmlConverter for converting TEI documents and 
solved the missing gap problem between xml tags with a combination of existing 
configiration parameters. However, I have have to investigate, which ones that 
have been. 

> Ruta: HTMLConverter: Option to convert tags outside body tags
> -------------------------------------------------------------
>
>                 Key: UIMA-4286
>                 URL: https://issues.apache.org/jira/browse/UIMA-4286
>             Project: UIMA
>          Issue Type: Improvement
>          Components: ruta
>    Affects Versions: 2.2.1ruta
>            Reporter: Mario Juric
>            Assignee: Peter Klügl
>             Fix For: 2.3.0ruta
>
>
> The HTML converter only converts tags that are found inside the body tag. 
> Therefore some information carrying tags like citations get left out when 
> applying the converter to XML articles with many metadata. It would be useful 
> to add the option to have all tags converted since this would allow content 
> outside the body to be parsed by natural language analysers as well.
> The converter was originally, as the name implies, conceived for HTML 
> documents but together with the HTML Annotator it can this way be more 
> generally useful in enabling NL parsing of a broader class of documents such 
> as articles stored in XML documents.
> An example of how this option might work can be given by disabling the 
> "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates 
> what offsets to apply to such annotations but otherwise the document 
> annotation offsets can be used. Empty tags can still be ignored but tags with 
> only attributes and no content should preferably be converted.
> Experiments with disabling the "in body"-constraint reveals that there will 
> be an additional need to separate the content metadata tags in the converted 
> text view. An NL parser reading the text will in many case read different 
> tags as one word or one sentence, which is not desirable. Some text delimiter 
> should therefore be inserted between tags were required, which optionally 
> could be customizable as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (UIMA-4286) Ruta: HTMLConverter: Option to convert tags outside body tags

Reply via email to