[ 
https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011643#comment-13011643
 ] 

Richard Eckart de Castilho commented on UIMA-2101:
--------------------------------------------------

Actually that depends on how you treat the XML data - document oriented or data 
oriented. In data-oriented XML, whitespace between two opening and two closing 
tags is so-called "ignorable whitespace" and may be added or omitted for sake 
for readability. Only whitespace between an opening and a closing tag needs to 
be preserved. If you look at the SAX handler interface, there are two different 
methods for receiving whitespace.

Thus, preserving the content in a round trip depends on what you had in mind 
when you implemented your parser and serializer. Looks like UIMA has 
data-oriented XML in mind when serializing. You should only need to respect 
that when parsing again.


> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a 
> single character document with a single annotation covering that one 
> character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
> language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but 
> I would also be happy if there was an alternate constructor or a method on 
> CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to