[
https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011660#comment-13011660
]
Richard Eckart de Castilho commented on UIMA-2101:
--------------------------------------------------
Generally trying to recover a document including annotations from a unlined XML
will not work because the inlined XML data model is less expressive than the
CAS data model: inlined XML cannot represent overlapping annotations.
The best option to recover you document from data-oriented XML is to make sure
all you text is covered by an annotation (e.g. Token) which should be a leaf in
the DOM and to use the offsets of these annotations to reconstruct the original
string. That assumes that there is no text between Tokens, that is no text
between two opening or two closing XML tags.
Otherwise formatting really needs to be turned off and serialization should
happen in such a way that, as you say, offsets are preserved. Again this
assumes that there are no overlapping annotations in the CAS. In this case you
need to make sure that you do capture ignorable whitespace when parsing the XML.
I have tried for a considerable time to implement a system for annotated
corpora based on XML as a data model and arrive at the conclusion that it does
more harm than good. Today I happy to use the CAS and its XMI serialization as
primary data and serialization models.
> CasToInlineXml adds whitespace
> ------------------------------
>
> Key: UIMA-2101
> URL: https://issues.apache.org/jira/browse/UIMA-2101
> Project: UIMA
> Issue Type: Bug
> Affects Versions: 2.3.1SDK
> Reporter: Steven Bethard
>
> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a
> single character document with a single annotation covering that one
> character, it will write:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document>
> <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1"
> language="x-unspecified">
> <uima.tcas.Annotation sofa="Sofa" begin="0" end="1">
> </uima.tcas.Annotation>
> </uima.tcas.DocumentAnnotation>
> </Document>
> {noformat}
> I think it should instead write everything in a single line, that is:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?>
> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1"
> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" end="1">
> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
> {noformat}
> I believe this could be fixed by replacing the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
> {noformat}
> with the line:
> {noformat}
> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
> {noformat}
> I think it's a bug that CasToInlineXml is changing the character offsets, but
> I would also be happy if there was an alternate constructor or a method on
> CasToInlineXml that allowed disabling the formatting.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira