[ 
https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011642#comment-13011642
 ] 

Steven Bethard commented on UIMA-2101:
--------------------------------------

Basically, I think you should be able to round-trip writing to XML and reading 
it back again. As it currently stands, the extra whitespace means that if you 
read the XML back in again, you won't get the original document text set on 
your Sofa, you'll get a whitespace-mangled version of it.

In the example above, the original DocumentAnnotation contained only a single 
space. After XML conversion, the DocumentAnnotation contains a newline and nine 
spaces. And there's no way to figure out which of those spaces were in the 
original text and which were added by XML conversion.

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a 
> single character document with a single annotation covering that one 
> character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
> language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but 
> I would also be happy if there was an alternate constructor or a method on 
> CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to