[ https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501248 ]
Thilo Goetz commented on UIMA-387: ---------------------------------- Adam, I have fixed the incorrect implementation of the XML 1.0 valid character checking. Please apply wherever you want. > XMI Serializer can write invalid control characters > --------------------------------------------------- > > Key: UIMA-387 > URL: https://issues.apache.org/jira/browse/UIMA-387 > Project: UIMA > Issue Type: Bug > Components: Core Java Framework > Affects Versions: 2.1 > Reporter: Adam Lally > Assignee: Adam Lally > Fix For: 2.2 > > > On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote: > > Hello, > > > > While trying to open an xmi file after processing in xml view, an > > error pops up telling me that there is an invalid  xml character. > > the error comes from the sax parser. Below is the stack trace. Thanks > > very much for your help, > > > Most control characters are not allowed in XML 1.0, even if they are > escaped with &#xxx. If your input document contains such characters, > the XMI CAS serializer is writing them to the output XMI document, > making it unreadable. > I checked that if you edit the XMI document and change the first line to: > <?xml version="1.1" encoding="UTF-8"?> > The problem goes away, because XML version 1.1 does allow escaped > control characters. > So one possibility for us to fix this in UIMA is to have the XMI CAS > Serializer generate XML version 1.1 tag by default. (I think we > considered that before and decided not to for some reason, maybe we > were worried that other applications might not be able to consume XML > 1.1? I can't remember. :) > Another possibility would be to have the XMI serializer automatically > replace these characters with spaces. The XCAS (not XMI) serializer > does that, but only for the document text, not for feature values. We > could also serialize the XMI using XML version 1.1, which allows > escaped control characters (but still not the 0x00 character). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.