[ https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493672 ]
Adam Lally commented on UIMA-387: --------------------------------- >We should make sure we don't generate XMI that we can't read back in. It's >better to throw an exception or replace characters or whatever >when generating the XMI than making the user believe everything's fine and >later they can't get at their results. I agree that would be better. It is too bad the XML Serialization code we're using (the built-in Xalan XSLT stuff from the JRE) doesn't already have this behavior. I hate to have to implement this by scanning every string for bad characters prior to serialization, that just seems so wasteful. Does that leave us with reimplementing the serialization ourselves? > XMI Serializer can write invalid control characters > --------------------------------------------------- > > Key: UIMA-387 > URL: https://issues.apache.org/jira/browse/UIMA-387 > Project: UIMA > Issue Type: Bug > Components: Core Java Framework > Affects Versions: 2.1 > Reporter: Adam Lally > Fix For: 2.2 > > > On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote: > > Hello, > > > > While trying to open an xmi file after processing in xml view, an > > error pops up telling me that there is an invalid  xml character. > > the error comes from the sax parser. Below is the stack trace. Thanks > > very much for your help, > > > Most control characters are not allowed in XML 1.0, even if they are > escaped with &#xxx. If your input document contains such characters, > the XMI CAS serializer is writing them to the output XMI document, > making it unreadable. > I checked that if you edit the XMI document and change the first line to: > <?xml version="1.1" encoding="UTF-8"?> > The problem goes away, because XML version 1.1 does allow escaped > control characters. > So one possibility for us to fix this in UIMA is to have the XMI CAS > Serializer generate XML version 1.1 tag by default. (I think we > considered that before and decided not to for some reason, maybe we > were worried that other applications might not be able to consume XML > 1.1? I can't remember. :) > Another possibility would be to have the XMI serializer automatically > replace these characters with spaces. The XCAS (not XMI) serializer > does that, but only for the document text, not for feature values. We > could also serialize the XMI using XML version 1.1, which allows > escaped control characters (but still not the 0x00 character). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.