[ https://issues.apache.org/jira/browse/UIMA-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365980#comment-14365980 ]
Peter Klügl commented on UIMA-4286: ----------------------------------- I remember that I used the HtmlConverter for converting TEI documents and solved the missing gap problem between xml tags with a combination of existing configiration parameters. However, I have have to investigate, which ones that have been. > Ruta: HTMLConverter: Option to convert tags outside body tags > ------------------------------------------------------------- > > Key: UIMA-4286 > URL: https://issues.apache.org/jira/browse/UIMA-4286 > Project: UIMA > Issue Type: Improvement > Components: ruta > Affects Versions: 2.2.1ruta > Reporter: Mario Juric > Assignee: Peter Klügl > Fix For: 2.3.0ruta > > > The HTML converter only converts tags that are found inside the body tag. > Therefore some information carrying tags like citations get left out when > applying the converter to XML articles with many metadata. It would be useful > to add the option to have all tags converted since this would allow content > outside the body to be parsed by natural language analysers as well. > The converter was originally, as the name implies, conceived for HTML > documents but together with the HTML Annotator it can this way be more > generally useful in enabling NL parsing of a broader class of documents such > as articles stored in XML documents. > An example of how this option might work can be given by disabling the > "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates > what offsets to apply to such annotations but otherwise the document > annotation offsets can be used. Empty tags can still be ignored but tags with > only attributes and no content should preferably be converted. > Experiments with disabling the "in body"-constraint reveals that there will > be an additional need to separate the content metadata tags in the converted > text view. An NL parser reading the text will in many case read different > tags as one word or one sentence, which is not desirable. Some text delimiter > should therefore be inserted between tags were required, which optionally > could be customizable as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)