[ 
https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011660#comment-13011660
 ] 

Richard Eckart de Castilho commented on UIMA-2101:
--------------------------------------------------

Generally trying to recover a document including annotations from a unlined XML 
will not work because the inlined XML data model is less expressive than the 
CAS data model: inlined XML cannot represent overlapping annotations.

The best option to recover you document from data-oriented XML is to make sure 
all you text is covered by an annotation (e.g. Token) which should be a leaf in 
the DOM and to use the offsets of these annotations to reconstruct the original 
string. That assumes that there is no text between Tokens, that is no text 
between two opening or two closing XML tags.

Otherwise formatting really needs to be turned off and serialization should 
happen in such a way that, as you say, offsets are preserved. Again this 
assumes that there are no overlapping annotations in the CAS. In this case you 
need to make sure that you do capture ignorable whitespace when parsing the XML.

I have tried for a considerable time to implement a system for annotated 
corpora based on XML as a data model and arrive at the conclusion that it does 
more harm than good. Today I happy to use the CAS and its XMI serialization as 
primary data and serialization models.

> CasToInlineXml adds whitespace
> ------------------------------
>
>                 Key: UIMA-2101
>                 URL: https://issues.apache.org/jira/browse/UIMA-2101
>             Project: UIMA
>          Issue Type: Bug
>    Affects Versions: 2.3.1SDK
>            Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a 
> single character document with a single annotation covering that one 
> character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
>     <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
> language="x-unspecified">
>         <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
> </uima.tcas.Annotation>
>     </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but 
> I would also be happy if there was an alternate constructor or a method on 
> CasToInlineXml that allowed disabling the formatting.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to