[ https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492934 ]
Marshall Schor commented on UIMA-387: ------------------------------------- I don't think we should (silently) change user data (i.e., replacing funny characters with spaces). I would prefer the XML 1.1 approach, unless someone has a reason 1.0 is needed. That still leaves the 0x00 character not being valid - Could we output something that was valid XML but when read in by our deserializer would be able to be converted back to 00? I suppose if we came up with such a mechanism, it could be used in XML 1.0 for all the "bad" characters. Maybe something like outputing a special XML element we define which has a hex representation of the bad character(s)? How does EMF handle this? -Marshall > XMI Serializer can write invalid control characters > --------------------------------------------------- > > Key: UIMA-387 > URL: https://issues.apache.org/jira/browse/UIMA-387 > Project: UIMA > Issue Type: Bug > Components: Core Java Framework > Affects Versions: 2.1 > Reporter: Adam Lally > Fix For: 2.2 > > > On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote: > > Hello, > > > > While trying to open an xmi file after processing in xml view, an > > error pops up telling me that there is an invalid  xml character. > > the error comes from the sax parser. Below is the stack trace. Thanks > > very much for your help, > > > Most control characters are not allowed in XML 1.0, even if they are > escaped with &#xxx. If your input document contains such characters, > the XMI CAS serializer is writing them to the output XMI document, > making it unreadable. > I checked that if you edit the XMI document and change the first line to: > <?xml version="1.1" encoding="UTF-8"?> > The problem goes away, because XML version 1.1 does allow escaped > control characters. > So one possibility for us to fix this in UIMA is to have the XMI CAS > Serializer generate XML version 1.1 tag by default. (I think we > considered that before and decided not to for some reason, maybe we > were worried that other applications might not be able to consume XML > 1.1? I can't remember. :) > Another possibility would be to have the XMI serializer automatically > replace these characters with spaces. The XCAS (not XMI) serializer > does that, but only for the document text, not for feature values. We > could also serialize the XMI using XML version 1.1, which allows > escaped control characters (but still not the 0x00 character). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.