Adam Lally (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Adam Lally reopened UIMA-387: > ----------------------------- > > > I don't agree with this fix. By putting the check in the XmiCasSerializer, > we're preventing anyone from using XML 1.1, which I don't think is good. I'd > rather put the check in the org.apache.uima.util.XMLSerializer class, and > make it conditional on the output XML version being set to 1.0. (Also fixing > "closer" to the actual XML generation code makes sense to me since I consider > this to be a bug in Xalan that we are working around.)
Sure, whatever. > > Also I think surrogates are not implemented properly. I saw your comment > that says "So it actually looks as if the surrogate case can be handled > correctly by just looking at individual Java chars," but I don't think that > is correct. You're right, I mentally switched to exclude mode when going to the #x10000-#x10FFFF range. So I'll fix this and then assign the issue to you so you can move the check to wherever you can agree with. > > The XML spec disallows characters such as � (a high surrogate). But > that doesn't mean you can exclude any Java char that is 0xD800. If it is > followed by a low surrogate, then the XML serializer will convert this to a > valid XML character in the 𐀀 - 𐿿 range. At least, the current > version of Xalan does. Depending on which version bundled with whatever JVM > someone's using, it might have a bug and not do it right. But as implemented > know the XMISerializer doesn't allow surrogate pairs at all. > > Finally, in addition to having the serializer catch invalid characters, I > think we still need to do more to address the issue, perhaps making the > XmiWriter cas consumer capable of using XML 1.1, maybe even by default. > >> XMI Serializer can write invalid control characters >> --------------------------------------------------- >> >> Key: UIMA-387 >> URL: https://issues.apache.org/jira/browse/UIMA-387 >> Project: UIMA >> Issue Type: Bug >> Components: Core Java Framework >> Affects Versions: 2.1 >> Reporter: Adam Lally >> Assignee: Adam Lally >> Fix For: 2.2 >> >> >> On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote: >>> Hello, >>> >>> While trying to open an xmi file after processing in xml view, an >>> error pops up telling me that there is an invalid  xml character. >>> the error comes from the sax parser. Below is the stack trace. Thanks >>> very much for your help, >>> >> Most control characters are not allowed in XML 1.0, even if they are >> escaped with &#xxx. If your input document contains such characters, >> the XMI CAS serializer is writing them to the output XMI document, >> making it unreadable. >> I checked that if you edit the XMI document and change the first line to: >> <?xml version="1.1" encoding="UTF-8"?> >> The problem goes away, because XML version 1.1 does allow escaped >> control characters. >> So one possibility for us to fix this in UIMA is to have the XMI CAS >> Serializer generate XML version 1.1 tag by default. (I think we >> considered that before and decided not to for some reason, maybe we >> were worried that other applications might not be able to consume XML >> 1.1? I can't remember. :) >> Another possibility would be to have the XMI serializer automatically >> replace these characters with spaces. The XCAS (not XMI) serializer >> does that, but only for the document text, not for feature values. We >> could also serialize the XMI using XML version 1.1, which allows >> escaped control characters (but still not the 0x00 character). >